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ABSTRACT 

In this paper, we propose a new approach for fast processing 
of SPARQL queries on large RDF datasets containing RDF 
quadruples (or quads). Our approach called RIQ employs 
a decrease-and-conquer strategy: Rather than indexing the 
entire RDF dataset, RIQ identifies groups of similar RDF 
graphs and indexes each group separately. During query 
processing, RIQ uses a novel filtering index to first identify 
candidate groups that may contain matches for the query. 
On these candidates, it executes optimized queries using a 
conventional SPARQL processor to produce the final results. 
Our initial performance evaluation results are promising: 
Using a synthetic and a real dataset, each containing about 
1.4 billion quads, we show that RIQ outperforms RDF-3X 
and Jena TDB on a variety of SPARQL queries. 

1. INTRODUCTION 

The Resource Description Framework (RDF) is a stan¬ 
dard model for representing data on the Web . It enables 
the interchange and machine processing of data by consid¬ 
ering its semantics. While RDF was first proposed with the 
vision of enabling the Semantic Web, it has now become pop¬ 
ular in domain-specific applications and the Web. Through 
advanced RDF technologies, one can perform semantic rea¬ 
soning over data and extract knowledge in domains such 
as healthcare, biopharmaceuticals, defense, and intelligence. 
Linked Data is a popular use case of RDF on the Web; 
it has a large collection of different knowledge bases, which 
are represented in RDF (e.g., DBpedia [10|). 

With a growing number of new applications relying on Se¬ 
mantic Web technologies (e.g., Pfizer [4]; Newsweek, BBC, 
The New York Times, and Best Buy [m and the availabil¬ 
ity of large RDF datasets (e.g.. Billion Triples Challenge 
(BTC) [^, Linking Open Government Data (LOGD) [^), 
there is a need to advance the state-of-the-art in storing, 

*An extended version of this WebDB 2014 paper has been 
published in the Journal of Web Semantics (JWS). (DOI: 
http://dx.doi.org/10.1016/j.websem.2016.03.005) 
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indexing, and query processing of RDF datasets. 

Today, datasets containing over a billion RDF quads are 
becoming popular on the Web (e.g., BTC [^, LOGD [^). 
Such datasets can be viewed as a collection of RDF graphs. 
Using SPARQL’s GRAPH keyword [^, one can pose a query 
to match a specific graph pattern within any single RDF 
graph. While researchers in the database community have 
proposed scalable approaches for indexing and query pro¬ 
cessing of large RDF datasets , 

they have designed these techniques for RDF datasets con¬ 
taining triples. In addition, none of them have investigated 
how large and complex graph patterns in SPARQL queries 
can be processed efficiently. Evidently, RDF-3X a pop¬ 
ular scalable approach for a local/centralized environment, 
yields poor performance when SPARQL queries containing 
large graph patterns are processed over large RDF datasets. 
This is because of the large number of join operations that 
must be performed to process a query. 

We posit that, on RDF datasets containing billions of 
quads, any approach that first finds matches for subpatterns 
in a large graph pattern and then employs join operations 
to merge partial matches will face a similar limitation. Mo¬ 
tivated by the aforementioned reasons, we propose a new 
approach called RIQ (RDF Indexing on Quads) and make 
the following contributions in this paper: 

• We propose a new vector representation for RDF graphs 
and graph patterns in SPARQL queries. This representa¬ 
tion enables us to group similar RDF graphs and index each 
group separately rather than constructing an index on the 
entire dataset. We propose a novel filtering index, which em¬ 
ploys a combination of Bloom Filters and Gounting Bloom 
Filters to compactly store it. 

• We propose a decrease-and-conquer approach to effi¬ 
ciently process a SPARQL query. Using the filtering index, 
we can methodically and quickly identify candidate groups 
of RDF graphs that may contain a match for the query. We 
can then execute optimized queries on the candidates using 
a conventional SPARQL processor that supports quads. 

• We report the results from our initial performance eval¬ 
uation using a synthetic and a real dataset, each containing 
about 1.4 billion quads. We observed that RIQ can out¬ 
perform RDF-3X and Jena TDB on a variety of SPARQL 
queries. 

2. BACKGROUND 

The RDF data model provides a simple way to represent 
any assertion as a (subject, predicate, object) triple. A col¬ 
lection of triples can be modeled as a directed, labeled graph. 
If each triple has a graph name (or context), it is called a 



PREFIX movie: <http://data.linkGdmdb.org/resource/niovie/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schGma#> 
PREFIX foaf: <http://xmlns.eom/foaf/0.l/> 

SELECT ?g ?producer Tname ?label ?page ?film WHERE { 

GRAPH ?g { ?producer movie:producer_name Tname . 
Tproducer rdfs:label Tlabel . 

OPTIONAL { Tproducer foaf:page Tpage . }■ . 
?film movie:producer Tproducer . } } 


Figure 1: An example of a SPARQL query 


Query => ’SELECT’ Variables ’WHERE’ ’{’ ’GRAPH’ Variables 
’{’ GroupGraphPattern ’}’ ’]■’ ResultModifiers 
GroupGraphPattern => BGP? ( GraphPatternNotTriples ’.’? BGP? )* 
GraphPatternNotTriples => 

GroupOrUnionGraphPattern I OptionalGraphPattern I Filter 
GroupOrUnionGraphPattern => 

GroupGraphPattern ( ’UNION’ GroupGraphPattern )* 
OptionalGraphPattern => ’OPTIONAL’ GroupGraphPattern 
Filter => ’FILTER’ Constraint 

Constraint => Predicate I ’EXISTS’ BGP I ’NOT EXISTS’ BGP 


quad. Below is an example of a quad from the BTC 2012 
dataset [o] with its subject, predicate, object, and context: 
<http://data.linkedmdb.org/resource/producer/10138> 
<http://data.linkedmdb.org/resource/movie/producer_ 
name> “Mani Ratnam” <http://data.linkedmdb.org/data/ 
producer/10138> Triples with the same context belong to 
the same RDF graph. 

Using SPARQL, one can express complex graph pattern 
queries on RDF graphs. One of the fundamental operations 
in RDF query processing is Basic Graph Pattern Match¬ 
ing [Zl- A Basic Graph Pattern (BGP) in a query com¬ 
bines a set of triple patterns. A triple pattern contains vari¬ 
ables (prefixed by ?) and constants. During query process¬ 
ing, the variables in a BGP are bound to RDF terms in 
the data, i.e., the nodes in the same RDF graph, via sub¬ 
graph matching [t]. Common variables within a BGP or 
across BGPs denote a join operation on the variable bind¬ 
ings of triple patterns. Consider the query shown in Fig¬ 
ure]^ The bindings for the subject (variable) in ?producer 
movie :producer_name ?name are joined with the bindings 
for the object (variable) in ?film movie: producer ?pro- 
ducer. The variable ?g will be bound to the names/contexts 
of those RDF graphs that contain a match for the graph 
pattern specihed inside the GRAPH block. OPTIONAL allows 
certain patterns to have empty bindings; UNION combines 
bindings of multiple graph patterns. 


3. RELATED WORK AND MOTIVATION 

Several approaches have been developed for indexing and 
querying RDF data in a local/centralized environment. Early 
approaches employed an RDBMS to store and query RDF 
data {e.g., Sesame [T^, Oracle [17| ). Unfortunately, the 
cost of self-joins on a single (triples) table became a seri¬ 
ous bottleneck. Later, Abadi et al. proposed the idea of 
vertically partitioning the property tables [29| and used a 
column-oriented DBMS to achieve an order of magnitude 
performance improvement over previous techniques [^. Re¬ 
cently, Neumann et al. developed RDF-3X |22| that builds 
exhaustive indexes on the six permutations of (s, p, o) triples. 
RDF-3X significantly outperformed the vertical partitioning 
approach. It uses a new join ordering method based on se¬ 
lectivity estimates and builds compressed indexes. Weiss et 
al. [28| developed Hexastore that also builds exhaustive in¬ 
dexes. However, Hexastore suffers from large index sizes due 
to lack of compression. Atre et al. developed BitMat to 
overcome the overhead of large intermediate join results for 
queries containing low selectivity triple patterns. BitMat 
performs in-memory processing of compressed bit matrices 
during query processing. 


More recently, Bornea et al. 13 developed DB2RDF by 


using an RDBMS to store and query RDF data. By storing 
the predicate-object pairs of each subject in the same row 
of the relational table, they reduced the number of joins 


Figure 2: Grammar for queries 



required for star-shaped BGPs. DB2RDF maintains only 
subject and object indexes and employs a novel SPARQL-to- 
SQL translation technique for generating optimized queries. 
Yuan et al. developed TripleBit, which uses a compact 
storage scheme for RDF data by representing triples via a 
Triple Matrix. For each predicate, TripleBit maintains SO 
and OS ordered buckets. Using a collection of indexes and 
optimal join ordering, it reduces the size of the intermediate 
results during query processing. 

A few approaches exploit the graph properties/structure 
of RDF data for indexing and query processing [25[ |27[ 
|14| |32[ |23| . These techniques, however, have been tested 
only on small RDF datasets containing less than 50 mil¬ 
lion triples. Recently, a few schemes were proposed for dis- 

Our work. 


tributed/parallel RDF query processing 20 


31 


however, focuses on RDF query processing in a local envi¬ 
ronment. 

The motivation for our work stems from two key obser¬ 
vations: First, the above approaches were designed to pro¬ 
cess RDF datasets containing triples. Simply ignoring the 
context in an RDF quad and using an existing approach de¬ 
signed for triples may produce incorrect results due to bind¬ 
ings for a BGP from different graphs Second, most of 
the queries tested by these approaches contain BGPs with 
a modest number of triples patterns (at most 8). None of 
them have investigated how to efficiently process SPARQL 
queries with large, complex BGPs {e.g., containing undi¬ 
rected cycleQ. 


4. THE DESIGN OF RlQ 

In this section, we present the design of RlQ (RDF Index¬ 
ing on Quadruples) and describe its three main components, 
i.e., the Indexing Engine, the Filtering Engine, and the Ex¬ 
ecution Engine. (See Figure]^) Our goal is to support a 
subset of the SPARQL grammar as shown in Figure]^ 

^Here is a simple example: {?a p ?b . ?b q ?c . ?a r ?c .}. 





































































Transformation fn 

Transformation /q 

/o(SPO, (s,p,o)) = (s,p,o) 

/q(‘s p o’) = (SPO,(s,p,o)) 

/d(SP?, (s,p,o)) = (s,p,?) 

/q(‘sp?u4) = (SP?,(s,p,?)) 

/d(S?0, (s,p,o)) = (s,?,o) 

/q(‘s ?Vp o’) = (S?0,(s,?,o)) 

/o(?PO, (s,p,o)) = (?,p,o) 

fqiVVsPo’) = (?PO,(?,p,o)) 

/b(S??, (s,p,o)) = (s,?,?) 

/q(‘s 7vp ?o’) = (S??,(s,?,?)) 

/b(?P?, (s,p,o)) = (?,p,?) 

fQ{‘7vs p ?no’) = (?P?,(?,p,?)) 

/b(??0, (s,p,o)) = (?,?,o) 

fQ{‘?Vs 7vp o’) = (??0,(?,?,o)) 


Table 1: Transformations in RIQ 


4.1 Indexing RDF Data 

We introduce a new vector representation for RDF graphs 
and BGPs, which will allows us to capture the properties of 
the triples and triple patterns in them. This vector repre¬ 
sentation plays a key role in the construction of an effective 
filtering index, where similar RDF graphs will be grouped 
together. 

4.1.1 Essential Transformations 

To begin with, we define two transformations: one for a 
triple in an RDF graph and the other for a triple pattern in 
a BGR Let P = {SPO, SP?, S70, IPO, S??, ?P?, ??0} be 
a set of canonical patterns. We denote the transformation 
on a triple (s,p,o) by /_d : P x {(s,p, o)} —>■ Od, where the 
range On is shown Table for each canonical pattern. Note 
that Od resembles triple patterns (variable names excluded) 
that can appear in a BGP. 

Next, we denote a transformation /q : T —>■ P x Oq, where 
T denotes the set of triple patterns that can appear in a 
query. The range P x Oq is shown in Table and identifies 
the canonical pattern for a given triple pattern. Although 
the triple pattern ‘s p o’ has no variables, it is still a valid 
triple pattern in a bgpQ 

The transformations fu and /q allow us to map a triple 
in the data and a triple pattern in a query to a common 
plane of reference. This will enable us to quickly test if a 
triple pattern in a BGP has a match in the data. 

4.1.2 Pattern Vectors 

Given an RDF graph with context c, we map it into a 
vector representation called a Pattern Vector (PV) and de¬ 
note it by Vc. Essentially, 14 = {Vc,spo, Vc,sp?, Pc.sro, 
Vc,?po, 14,s??, Vc,?p?, 14,??o), where each 14,r denotes the 
vector constructed for r € P. We assume a hash function 
H : B > Z*, where B denotes a bit string and the range is 
the set of non-negative integers. Now, we construct 14 as 
follows: Initially, each 14,r is empty. Given a quad (s,p, o, c) 
in the graph, for each r £ P, we compute ]HI(/_D(r, {s,p, o))) 
and insert it into 14,r. We perform this computation on ev¬ 
ery quad in the graph to generate 14. Note that 14 requires 
space linear in the number of quads in the graph. 

Our hash function H is based on Rabin’s fingerprinting 
technique [24| , which is efficient to compute. If we generate 
32-bit hash values, the probability of collision is extremely 
low [^. Thus, in practice, we can view 14, spo as a set, 
because the quads/triples in a graph are always assumed 
to be unique. However, the remaining vectors of 14 should 
be viewed as multisets, because fo can produce the same 
output for different triples due to the presence of ‘?’ in the 
output. 

^SELEGT ?g WHERE { GRAPH ?g { s p o . } }. 


Given a BGP q, we map it into a PV, denoted by I4, and 
compute it slightly differently: Initially, each 14,r is empty. 
For each triple pattern t in q, we compute /qV) to produce 
a pair (r, o), where r denotes the canonical pattern for t. We 
then insert H(o) into 14,r-- As before, Vq^spo can be viewed 
as a set. The rest of the vectors of I4 should be viewed 
as multisets, because two different triple patterns (each con¬ 
taining at least one variable) in a BGP may hash to the same 
value. For example, if a BGP contains two triple patterns 
?Si movie:producer ?Oi and ?S 2 movie:producer ? 02 ,then 
/q(‘?Si movie:producer ?Oi’) = /q(‘?S 2 movie:producer 
? 02 ’) and therefore, the hash values produced by H will be 
identical. 

4.1.3 Operations on Pattern Vectors 

Next, we define two operations on PVs, which will be used 
during the construction of the filtering index. Our goal is 
to group similar PVs (and as a result, similar RDF graphs) 
together so that candidate RDF graphs are identified and 
processed quickly during query processing. 

Definition 1 (Union). Given two PVs, say 14 and 
Vb, their union 14 UVb is a PV say 14, where 14,r 14,r U 
14 ,.r and r £ P. 

Definition 2 (Similarity). Given two PVs, say 14 
and 14 , their similarity is denoted by sim{Va,Vb) = max 

sim(Va,r,Vb,r), whcrc sim{Va,r,Vb,r) = • 

4.1.4 Index Construction 

We begin by describing a key necessary condition, which 
forms the basis for indexing and query processing in RIQ. 
Because we map both the RDF graphs and BGPs into their 
PVs, we must characterize the relationship between them 
when processing a BGP - assuming it is a connected graph 
- via subgraph matching. We state the following theorem. 

Theorem 1. Suppose 14 and I4 denote the PVs of an 
RDF graph and a BGP, respectively. If the BGP has a sub¬ 
graph match in the RDF graph, then /\ (14,r 4 f4,r) = 

rgP 

TRUE. (See the technical report l20j for the proof.) 

According to Theorem given a BGP, if we can iden¬ 
tify those RDF graphs in the database whose PVs satisfy 
the necessary condition, then we have a superset of RDF 
graphs that contain a subgraph match for the BGP. This 
also guarantees that there are no false dismissals. 

Rather than testing every PV in the database - one-at-a- 
time - during query processing, we propose a novel filtering 
index called the PV-Index to effectively organize millions of 
PVs in the database. Using this index, we aim to quickly 
identify candidate RDF graphs in the early stages of query 
processing using Theorem Our goal is to discard most of 
the non-matching RDF graphs without any false dismissals. 
As a result, the subsequent stages of query processing will 
process fewer candidates to obtain the final results, thereby 
speeding up query processing. 

There are two issues that arise while designing the PV- 
Index: First, we want to group similar PVs together so that 
for a given BGP, we can quickly discard most of the non¬ 
matching RDF graphs. Second, we want to compactly store 
the PV-Index to minimize the cost of I/O during query pro¬ 
cessing. To address the first issue, we use the concept of 















locality sensitive hashing (LSH) For similarity on sets 
based on the Jaccard index, LSH on a set S, denoted by 
LSHfc,j,m(S') can be performed as follows [^: Pick k x I ran¬ 
dom linear hash functions of the form h{x) = (ax + b) mod u, 
where u is a prime, and a and b are integers such that 
0 < a < M and 0 < b < u. Compute g(S) = min{/i( 2 ;)} 
over all items in the set as the output hash value for S. 
Each group of I hash values is hashed {e.g., using Rabin’s 
fingerprinting) to the range [0, m — 1]. This results in k hash 
values for S. It is known that given two sets Si and S 2 
with similarity p = Pr[s(Si) = g(S 2 )] = p. Also, 

the probability that LSHfe,i,m(Si) and LSHfe,i,m(S 2 ) have at 
least one hash value identical is 1 — (1 — p*)*’. The above 
properties also hold for multisets. 

To address the second issue, we employ Bloom filters (BFs) 
and Counting Bloom filters (CBFs) to compactly repre¬ 
sent the PV-Index. A Bloom filter is a popular data struc¬ 
ture to compactly represent a set of items and process mem¬ 
bership queries on it. A Counting Bloom filter maintains 
n-bit counters instead of single bits and can represent mul¬ 
tisets. Both BFs and CBFs can be configured to achieve a 
false positive rate based on their capacities [15|. 


Algorithm 1 The PV-Index Construction 

Input: a list of PVs; (fc,/,m): LSH parameters; e: false 
positive rate 

Output: filters of all the groups of similar RDF graphs 
1: Let G(V, E) be initialized to an empty graph 
2: for each PV V do 
3: Add a new vertex Vi to V 

4: for each r G P do 

5: {hii, -4— LSH 

k,l,m (Vr) 

6: for every Vj G V and i ^ j do 

7: if 3 o s.t. 1 < o < fc and hio = hjo then 

8: Add an edge {vi, Vj) to E if not already present 

9: Compute the connected components of G. Let 
{Cl,..., Ct} denote these components. 

10: for i = 1 to t do 

11: Compute the union Ui of all PVs corresponding to the 

vertices in Ci 

12: Construct a BF for Ui,spO with false positive rate e 

given the capacity \Ui^spo\ 

13: Construct a CBF for each of the remaining vectors of 

Ui with false positive rate e given the capacity 
14: Store the ids of graphs belonging to Ci 

15: return 


In Algorithm we outline the steps to construct the PV- 
Index. We build a graph G, where each vertex of G repre¬ 
sents a PV. For every PV, we apply LSH on each of its seven 
vectors. Suppose there are two PVs such that the applica¬ 
tion of LSH on their vectors for the same pattern r, produces 
at least one identical hash value, then we add an edge be¬ 
tween the vertices representing these PVs (Lines to |^. 
Essentially, a missing edge between two vertices indicates 
that their corresponding PVs are dissimilar with high prob¬ 
ability. Once G is constructed, we compute (in linear time) 
the connected components in it. Each connected component 
represents RDF graphs whose corresponding PVs are simi¬ 
lar with high probability. We treat these graphs as a group 
and compute the union of their PVs (Line [TT|) . The union 
operation summarizes the PVs as well as preserves the con¬ 
dition stated in Theorem]^ (The individual vectors in a PV 
are sorted to enable the union operation in linear time.) 

To compactly represent the union computed for a con- 
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Figure 4: An example of a BGP Tree 

nected component, we use a combination of one Bloom fil¬ 
ter (BF) and six Counting Bloom filters (CBFs). The vector 
for the canonical pattern SPO is stored using a BF and the 
others are stored using CBFs. Each filter of a vector Is con¬ 
figured for a false positive rate of e and capacity equal to the 
cardinality of the vector (Lines and |13[ ). For each con¬ 
nected component, we also store the ids of graphs belonging 
to it. In summary, the BFs and CBFs for all the connected 
components constitute the PV-Index. Each group of graphs 
is separately indexed using a tool like Jena TDB. 

4.2 Query Processing 

Next, we propose a decrease-and-conquer approach for ef¬ 
ficient SPARQL query processing in RIQ. That is, we first 
identify candidate groups of RDF graphs that may contain 
matches for a query using the PV-Index and then execute 
optimized SPARQL queries on these candidates. 

Given a query, the first step is to parse its GRAPH block 
according to the grammar in Figure]^ and generate a tree- 
representation, which we call the BGP Tree. This tree serves 
as an execution plan for processing individual BGPs in the 
query. (See Figure 13 for an example.) We maintain a 
Boolean variable eval^\ for each node n in the tree to de¬ 
note the status of the evaluation on a connected component 
of the PV-Index. With eual[n] = FALSE for every node in the 
tree, we invoke Algorithm on each connected component, 
starting from the root of the BGP Tree. When a child of 
GroupGraphPattem evaluates to FALSE, we skip processing 
the remaining children (Line|^, because the RDF graphs 
belonging to that connected component will not produce a 
match for the subexpression rooted at GroupGraphPattem. 
For GroupOrUnionGraphPattem, however, at least one of its 
children i.e., GroupGraphPattem, should evaluate to TRUE 
to produce a match (Line[^. 

When a BGP is encountered (Line [T5| ), we test the neces¬ 
sary condition stated in Theorem by calling Algorithm 
This involves the processing of membership queries on the 
BF and GBFs constructed for that connected component. If 
OptionalGraphPattem evaluates to FALSE, we return TRUE 
because of the semantics of OPTIONAL in SPARQL. If eval\root\ 
= TRUE, then the group of RDF graphs belonging to that 
connected component is a candidate for further processing. 

For the candidate, an optimized SPARQL query can be 
generated by traversing the BGP Tree and checking the eval¬ 
uation status of each node. (In the interest of space, we 
provide the algorithm in the technical report |26| .) 

The result modifiers and predicates within FILTER are in¬ 
cluded in the optimized query. In Figure we show an 
example, where the OPTIONAL block and one block in the 







Algorithm 2 EvalBGPTree(node n, conn, component j) 

1: Let Cl,C t denote the child nodes of n (left-to-right) 

2: for i = 1 to r do 

3: eval[ci] t— EvalBGPTree(ci, j) 

4: if n is GroupGraphPattern & eval[ci] = FALSE then 

5: evalln] ■(— FALSE 

6: return FALSE {//skip rest of the nodes} 

7: if n is GroupOrUnionGraphPattern then 

r 

8: eval[n] -h- V eval[ci] 

i = l 

9: else if n is EXISTS then 
10: eval[n] eval[ci] 

11: else if n is NOT EXISTS then 

12: eval[n] ^ TRUE 

13: else if n is Predicate then 

14: eval[n] TRUE {//skip processing predicates} 

15: else if n is BGP then 

16: Let q denote the basic graph pattern 

17: eval[n] <— IsMatch(g, j) 

18: else 

19: eval[n] evallcr] 

20: if n is DptionalGraphPattern then 
21: return TRUE 

22: return eval[n\ 


For LUBM, the query set included 3 SPARQL queries with 
large, complex BGPs (L1-L3) and 9 others (L4-L12) that are 
variations of the queries in the LUBM benchmark. For BTG 
2012, the query set also included 2 SPARQL queries with 
large, complex BGPs (Bl, B2) and 5 others (B3-B7). (In 
the interest of space, the queries are listed in the technical 
report [26| .) The number of triples patterns in each query 
and the number of results obtained for each query using the 
three approaches are shown in Table 

5.2 Index Size 

Here we report the size of the indexes built by the three 
approaches. For LUBM, the size of the index built by RDF- 
3X and Jena TDB were 77 GB and 121 GB, respectively. 
The filtering index of RIQ was 8.5 GB in size and had 339 
unions. For BTG 2012, the size of the index built by RDF- 
3X and Jena TDB were 87 GB and 110 GB, respectively. 
RIQ’s filtering index was 16 GB in size and had 2620 unions. 
Note that the size of the LUBM and BTG 2012 datasets 
were 217 GB and 218 GB, respectively. When constructing 
the filters of both datasets in RIQ, we set the false positive 
rate e equal to 5%. 


Algorithm 3 IsMatch(BGP q, conn, component j) 

1: For connected component j, let denote the BF or 
CBF constructed for pattern r 
2: Construct Fq,r with the same capacity and false positive 
rate as ¥uj,r 

3: if (1) for each bit in Fq^spo set to 1, the corresponding 
bit in ¥uj,SPO is 1, and (2) for each of the remaining 
patterns, given a non-zero counter in F^^r, the corre¬ 
sponding counter in ¥uj,r is greater than or equal to it 

then 

4: return TRUE, otherwise return FALSE 


UNION will be discarded in the optimized query. The opti¬ 
mized query can then be executed on the candidate using 
a tool like Jena TDB. The results from all the candidates 
should be combined to produce the final results. 


5. PERFORMANCE EVALUATION 

In this section, we report the initial performance evalu¬ 
ation of RIQ and have compared it with the latest version 
of RDF-3X and Apache Jena 2.11.1 (TDB). RDF-3X and 
Jena TDB readily index datasets with more than a billion 
triples. Also, Jena supports RDF quads. We ran all the 
experiments on a 64-bit Ubuntu 12.04 machine with 4 In¬ 
tel Xeon 2.4GHz cores and 16GB RAM. RIQ uses popular 


open-source libraries for parsing RDF data 11 and con¬ 


structing BFs and CBFs [^. All the three approaches were 
single-threaded. 


5.1 Datasets and Queries 

We used one synthetic and one real dataset in our ex¬ 
periments. The synthetic dataset was generated using the 
Lehigh University Benchmark (LUBM) and contained 
1.38 billion triples, 18 unique predicates, and 10,000 uni¬ 
versities. The triples were divided across 200,004 files and 
each file was treated as one RDF graph. The real dataset 
was BTC 2012 which is widely used in the Semantic 
Web community. It contained 1.36 billion RDF quads with 
57,000 unique predicates and 9.59 million RDF graphs. 


5.3 Query Processing 

We measured the wall-clock time taken to process each 
query in both cold and warm cache settings, and report the 
average over 3 runs in Table Jena TDB was executed 
with its default statistics-based optimization. 

For LUBM, RIQ processed queries with large, complex 
BGPs (L1-L3) significantly faster than RDF-3X and Jena 
TDB in both cold and warm cache settings. For BTC 2012, 
RIQ was significantly faster than RDF-3X in processing queries 
Bl and B2. This demonstrates that the decrease-and-conquer 
approach of RIQ is more effective than the popular join- 
based processing (by first matching individual triple pat¬ 
terns) on queries with large, complex BGPs. All of the large, 
complex queries had at least one undirected cycle. RIQ iden¬ 
tified a maximum of 22 candidate groups for queries L1-L3 
and 4 candidate groups for queries Bl and B2. 

Next, we report the performance of RIQ on queries with 
(small) BGPs containing less than 8 triple patterns (L4-L12 
and B3-B7). Interestingly, on LUBM, RIQ was faster than 
RDF-3X and Jena TDB for six out of the nine queries in 
both cold and warm cache settings. On BTC 2012, RIQ 
was the fastest in the cold cache setting for three out of 
the five queries. However, RDF-3X was the fastest in the 
warm cache setting for four out of the five queries. Finally, 
we compared the three approaches based on the geometric 
mean of their query processing times. Clearly, RIQ was the 
winner for both LUBM and BTC 2012. 


6. CONCLUSIONS 

We presented RIQ, a new approach for indexing large 
RDF datasets containing quads. RIQ employs a decrease- 
and-conquer a.ppToa.ch to efficiently process SPARQL queries. 
Through our experiments, we demonstrate that RIQ enables 
efficient SPARQL query processing on large RDF datasets 
with more than a billion quads. 
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Table 2: Query processing times (in seconds) for LUBM and BTC-2012. Best times are shown in bold within 
shaded cells. indicates that the query ran for more than X seconds and was terminated. 
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